Automated Name Selection for the Network Scale-up Method

Adrià Fenoy

Autonomous University of Barcelona

Michał Bojanowski

Autonomous University of Barcelona &
Kozminski University

Miranda J. Lubbers

Autonomous University of Barcelona

2023-05-02

How many people do you know?

  • Network Scale-Up Method for estimating network degree
  • Choosing subpopulations well is not straightforward
  • We propose a solution!

Outline

  1. Network Scale-Up Method
  2. Challenge of choosing subpopulations
  3. The Solution

Network Scale-Up Method

Network Scale-Up Method

NSUM = Survey instrument + Statistical model

  • Estimating personal network sizes1
  • Estimating size(s) of hidden populations2

NSUM – survey instrument

How many people do you know who are members of [subpopulation]?

Dichotomous (Baum and Marsden 2023)

Do you know anyone who is a member of [subpopulation]?

NSUM – survey instrument

BRIDGES survey

Now I will ask you about the people you know in Spain in general. I will ask about the people you know with certain characteristics. By knowing someone we understand that you know the first name of this person and you would recognize one another if you ran into them for example in the street, in a shop, or in another place. This includes both close relationships such as your partner, family, friends, neighbors, coworker or fellow students and less close relationships, such as for example people whom you have met in the associations to which you belong or who you know via other people.

These people do not have to live near you, you can also be in contact with them through social media or by phone. You may like them or not. Please do not include deceased persons, people under 18 years old, nor yourself.

How many people over the age of 18 do you know (by name and by sight) who have the following jobs, whether they are women or men?

NSUM – estimation and modeling

  • Aggregated Relational Data (ARD)
    • \(y_{ik}\) – number of persons from subpopulation \(k\) “known” to \(i\)
  • Types of parameters estimated
    • Estimated degree \(d_i\)
    • Overdispersion
    • Effects and biases
  • MLE (Killworth et al. 1990)
  • Bayesian hierarchical models
    • Zheng, Salganik, and Gelman (2006)
    • McCormick, Salganik, and Zheng (2010)
    • DiPrete et al. (2011)
    • Feehan et al. (2016)
    • Laga, Bao, and Niu (2021a)
    • Laga, Bao, and Niu (2021b) (a review)
    • Baum and Marsden (2023)

NSUM – subpopulations, biases and effects

  • First names
  • Occupational groups
  • Ethnic groups
  • Intersections with demographic groups
  • Transmission error – ego is not aware that alter belongs to subpopulation
  • Barrier effect – some egos systematically know more/fewer members of a subpopulation than under random mixing
  • Recall error – ego can’t recall the people he knows

Choosing subpopulations for NSUM

Subpopulations for degree estimation

McCormick, Salganik, and Zheng (2010) suggest:

  1. Use first names, provided that population name statistics are available
  2. Use names with a prevalence 0.1% - 0.2% – good tradeoff between recall error and estimation precision
  3. Use subset of names for which a joint distribution of traits such as gender, age is similar to that of the overall population

Subpopulations for degree estimation

Another area for future methodological work is formalizing the procedure used to select names that satisfy the scaled-down condition. Our trial-and-error approached worked well here because there were only eight alter categories, but in cases with more categories, a more automated procedure would be preferable.
(McCormick, Salganik, and Zheng 2010)

Solution

The problem

Number of persons in the population with name \(i\) belonging to category \(j\)

\[f_i^j\]

Marginal distribution of traits in population

\[f^j = \sum_{i} f_i^j\]

Marginal distribution of traits in selected subset \(S\) of names

\[\hat{f}^j = \frac{\sum_{i \in S} f^j_i}{\sum_{i \in S} \sum_k f^k_i}\]

The problem

\[\arg\min_S \sum_j D(f^j, \hat{f^j})\]

Given the…

  • distribution of names and traits in the population (\(f_i^j\))

… find a subset \(S\) of names for which the selected distance measure \(D(\cdot)\) comparing…

  • the distribution of traits in the subset to the
  • distribution of traits in the population

… is as small as possible.

Distance measures

  • Kullback-Leibler divergence: \(D_{KL}(f^j, \hat{f}^j) = f^j \log \left(\frac{f^j}{\hat{f}^j}\right)\)
  • Jensen-Shannon divergence: \(D_{JS}(f^j, \tilde{f}^j) = \frac{1}{2} D_{KL}(f^j, \frac{f^j + \tilde{f}^j}{2}) + \frac{1}{2} D_{KL}(\tilde{f}^j, \frac{f^j + \tilde{f}^j}{2})\)
  • Absolute distance: \(D_{L1} = \left| f^j - \tilde{f}^j \right|\)
  • Quadratic distance: \(D_{L2} = \left( f^j - \tilde{f}^j \right)^2\)

Quadratic problem formulation

Let

\[\alpha = \frac{1}{\sum_{i \in S} \sum_k f^k_i}\]

then

\[\arg\min_S \sum_j \left(f^j - \alpha \sum_i f^j_i x_i \right)^2\]

where \(x_i=1\) if name \(i\) is in the subset and \(x_i = 0\) otherwise

Algorithm

Illustration

Countries

Candiate names

Hungary

Belgium

Belgium

Belgium

Thanks!

More info: http://coalesce-lab.com/en

Baum, Derick S, and Peter V Marsden. 2023. “Uses and Limitations of Dichotomous Aggregate Relational Data.” Social Networks 74: 42–61.
DiPrete, Thomas A., Andrew Gelman, Tyler McCormick, Julien Teitler, and Tian Zheng. 2011. “Segregation in Social Networks Based on Acquaintanceship and Trust.” American Journal of Sociology 116 (4): 1234–83. https://doi.org/10.1086/659100.
Feehan, Dennis M., Aline Umubyeyi, Mary Mahy, Wolfgang Hladik, and Matthew J. Salganik. 2016. “Quantity Versus Quality: A Survey Experiment to Improve the Network Scale-up Method.” American Journal of Epidemiology 183 (8): 747–57.
Killworth, Peter D., Eugene C. Johnsen, H. Russell Bernard, Gene Ann Shelley, and Christopher McCarty. 1990. “Estimating the Size of Personal Networks.” Social Networks 12 (4): 289–312. https://doi.org/10.1016/0378-8733(90)90012-X.
Killworth, Peter D., Eugene C. Johnsen, Christopher McCarty, Gene Ann Shelley, and H. Russell Bernard. 1998. “A Social Network Approach to Estimating Seroprevalence in the United States.” Social Networks 20 (1): 23–50. https://doi.org/10.1016/S0378-8733(96)00305-X.
Killworth, Peter D., Christopher McCarty, H. Russell Bernard, Gene Ann Shelley, and Eugene C. Johnsen. 1998. “Estimation of Seroprevalence, Rape, and Homelessness in the United States Using a Social Network Approach.” Evaluation Review 22 (2): 289–308. https://doi.org/10.1177/0193841X9802200205.
Laga, Ian, Le Bao, and Xiaoyue Niu. 2021a. “A Correlated Network Scale-up Model: Finding the Connection Between Subpopulations.” https://arxiv.org/abs/2109.10204.
———. 2021b. “Thirty Years of The Network Scale-up Method.” Journal of the American Statistical Association, 1–33.
McCormick, Tyler H., Matthew J. Salganik, and Tian Zheng. 2010. “How Many People Do You Know? Efficiently Estimating Personal Network Size.” Journal of the American Statistical Association 105 (489): 59–70. https://doi.org/10.1198/jasa.2009.ap08518.
Zheng, Tian, Matthew J. Salganik, and Andrew Gelman. 2006. “How Many People Do You Know in Prison? Using Overdispersion in Count Data to Estimate Social Structure in Networks.” Journal of the American Statistical Association 101 (474): 409–23. https://doi.org/10.1198/016214505000001168.